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Abstract 

We propose a novel approach for density estimation called histogram trend fil¬ 
tering. Our estimator arises from looking af surrogafe Poisson model for counfs of 
observafions in a parfifion of fhe supporf of fhe dafa. We begin by showing consis- 
fency for a variafional estimator for fhis densify estimation problem. We fhen sfudy 
a discrefe esfimafor fhaf can be efficienfly found via convex opfimizafion. We show 
fhaf fhe esfimafor enjoys sfrong sfafisfical guarantees, yef is much more pracfical 
and compufafionally efficienf fhan ofher esfimafors fhaf enjoy similar guaranfees. 
Finally, in our simulation sfudy fhe proposed mefhod showed smaller averaged 
mean square error fhan compefing mefhods. This favorable blend of properfies 
makes histogram frend filtering an ideal candidafe for use in roufine dafa-analysis 
applications fhaf call for a quick, efficienf, accurate densify esfimafe. 

Key words: densify esfimafion, penalized likelihood, frend filtering, Sobolev spaces 


1 Introduction 

1.1 Nonparametric density estimation 

Consider the classic problem of one-dimensional density estimation, where we observe 
Hi ^ f for i = 1,... ,n and wish to estimate /. Most data-analysis practitioners that 
confront this problem turn to kernel density estimation, due to its familiarity, its compu¬ 
tational efficiency, and its well-understood statistical properties. Yet kernel methods are 
known to suffer from the local-adaptivity problem, wherein the used of a fixed band¬ 
width parameter may result in simultaneously undersmoothing and oversmoothing in 
different regions of the density. 
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A huge variety of methods have been proposed that improve upon basic kernel 
methods in a way that addresses this problem, from adaptive kernel bandwidths to 
penalized-likelihood estimation. Yet these methods typically either incur a much higher 
computational burden than basic kernel methods, or else they involve hyperparameters 
that are difficult to specify and tune. The goal of this paper is address this gap. We 
propose a method called histogram trend filtering, which solves the adaptivity problem 
while simultaneously satisfying all three of the following criteria: 

1. It is computationally efficient, even for large data sets. 

2. It works out of the box, with no user-specified tuning parameters. 

3. It has strong statistical guarantees. 

These three factors make our proposed method a strong candidate to replace ordinary 
kernel density estimation as the default "first pass" for data-analysis practitioners. 

The histogram trend-filtering estimator is related to the following variational opti¬ 
mization problem based on penalizing the log likelihood g{x) = log f{x): 


minimize 

9 

subject to 


n 

- ^9{yi) 

i=l 



Jig) < t, 


( 1 ) 


where J{g) is a known penalty functional. Imposing an appropriate penalty can encour¬ 
age smoothness and avoids estimates that are sums of point masses. 

Specifically, we consider solutions to Q for penalties based on total variation, as pro¬ 
posed by Koenker and Mizera (2007 >. We provide conditions under which explicit rates 
of convergence can be obtained for these estimators. We also study a finite-dimensional 
version of this variational problem—^histogram trend filtering—which involves two con¬ 
ceptually simple steps. First, partition the observations into Dn histogram bins with 
centers < ••• < and counts xi,... ,XDn- Then assume the surrogate model 
Poisson(Aj) and estimate the A/s via polynomial trend filtering (|Kim et al. 2009 


Tibshirani ||2014[| applied to the Poisson likelihood. The renormalized A/s then may be 


used to form an estimate of /q. 

Our results show that this simple, computationally efficient procedure yields excel¬ 
lent performance for density estimation. Our main theorems characterize how the opti¬ 
mal bin size must shrink as a function of n to ensure consistency for estimating /o, and 
provide bounds on the proposed procedure's reconstruction error under the assumption 
that the bins are chosen accordingly. Our empirical results also show that the histogram 
trend-filtering estimator is adaptive to changes in smoothness of the underlying density 
when familiar information criteria are used to choose the method's single tuning param¬ 
eter. Put simply, in can yield an estimate that is simultaneously smooth in some regions 
and spiky in others. This behavior contrasts favorably with kernel density estimation, 
where the bandwidth parameter governs the global smoothness of the estimate. 


2 















1.2 Histogram trend filtering 


The idea of histogram trend filtering is to reduce the density estimation problem to that 
of a nonparametric Poisson regression problem, which is solved by trend filtering (|Kim 


et al. 2009 Tibshirani and Taylor[[20TT). The method is so computationally efficient for 


two reasons: (1) because binning the data results in a huge reduction from data points 
to bin counts, and (2) because the trend-filtering estimator for a Poisson regression can 


be obtained so cheaply, using the extraordinarily fast ADMM algorithm of Ramdas and 


Tibshirani (2014|. An important point for us to demonstrate is that the data reduction 


step can be done without losing too much information; we address this concern later. 

Let us now construct in detail the histogram trend-filtering estimator, which can be 
viewed as a discrete approximation to Problem Q when J penalizes the total variation 
of g or higher-order versions thereof. We begin with several assumptions made for ease 
of exposition. Let X <zTZ denote the support of /q. Suppose that A is a compact set that 
it is partitioned into Dn disjoint intervals Ij with midpoints such that IJ^ Ij = X. We 
assume that the intervals are of equal length Sn and ordered so that < • • • < Any 
of these assumptions can be relaxed in practice. 

Now consider a histogram of the observations using bins Ij. Let xj = #{xi G Ij} 
denote the histogram count for bin j, and consider the surrogate model 


Xj ~ Poisson(Aj) , Xj = n6nf{Cj) fo{y) dy . (2) 

Jij 

Let dj = log Xj be the log rate parameter for bin j, let 0 = {0i,... ,6 d„), and define the 
loss function 

Dn 

'(«) = E b"'- 

i=i 

as the negative log likelihood corresponding to Model Q. We propose to estimate 9 
using the solution to the unconstrained optimization problem 


minimize 1(9) + t\\A^’^+^'>9\\p , 
6»e7^o ^ 


(3) 


where is the discrete difference operator of order k. Concretely, when A: = 0, 

is the matrix encoding the first differences of adjacent values: 


A(i) 


/ 1 -1 0 0 
0 1-10 


0 \ 
0 


\ 0 ••• 0 1 -1 / 


(4) 


For k > 1 this matrix is defined recursively as = A^^^ where A^^^ from (|^ is 

of the appropriate dimension. 

We focus on problem (|^ when q = p = 1, which corresponds to the polynomial 
trend-filtering estimator under a Poisson likelihood. Intuitively, the trend-filtering es- 
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timator is similar to an adaptive spline model: it places a lasso penalty on a discrete 
analogue of the order-fe derivative of the underlying log-density, resulting in a piece- 
wise polynomial estimate whose degree depends on k. Trend filtering has been studied 
extensively in the context of function estimation, generalized linear models, and graph 
denoising ( |Kim et ^ 2009 Tibshirani and Taylor[ 2011| Wang et al. 2014| |. 

The goal of this paper is to understand the statistical properties of this method as an 
approach to density estimation, and therefore we do not discuss details of implemen¬ 
tation. However, we note that problem (|3jl can be solved efficiently when q = p = 1 
using the augmented-Lagrangian method from Ramdas and Tibshirani ( |2014| |, as imple¬ 
mented in the glmgen R package (Arnold et al. 2014| |. When q = p = 2, the objective is 
differentiable, and any standard gradient-based or quasi-Newton optimization method 
may be used. 


2 Connections with previous work 


In this section we present a brief review of density estimator related to our methods. 
We begin by discussing the seminal work from Good and Gaskins ( 1971[ > which can be 
motivated from a Bayesian perspective. This starts by considering the prior 


p{f) (X exp (-4'(/)) I (/ G A) 

where $ is a roughness penalty and A is some class of density functions. Then, given 
the usual likelihood 


P{x I /) = Ylfixi), 


2 = 1 


the authors in Good and Gaskins ( 1971| produce a maximum a posteriori (MAP) esti¬ 
mate of /o by solving 


/ = argmin^g_^ - logp{x \ f) + 4>(/) 


(5) 


This is the main focus of study in Good and Gaskins ( [1971 1, where one of the choices of 
roughness penalty is proportional to Fisher's information concerning the displacement 
or location, regarded as a parameter: 


r 

Hf) = / 

J —c 


{nx)f 

fix) 


dx 


( 6 ) 


The consistency properties of the estimator ^ were briefly studied in [Good arid 


Gaskins| (|1971|, where the authors showed convergence in the sense of probability of 


integrals of the form 


f{x)dx 


foix)dx. 
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However, this does not imply the absence of false bumps: they could become small and 
numerous as n increases. A more complete characterization of the estimator / with the 
choice of penalty ^ was given in de Montricher et al. ( 1975| . There, the main result 
is the proof of the existence and uniquenes of /. Moreover, the result holds with more 
generality allowing the class functions A to be a reproducing Hilbert space, and stating 
that if T' is the square of the norm of such reproducing space, then / exists and it is 
unique. 

In a variation of the estimator from Good and Gaskins ( 1971| , Silverman] ( |1982| | 
works within the framework of roughness penalty. However, rather than penalties di¬ 
rectly imposing constraints on the density space, Silverman ( 1982| proposes to penalize 
the log density. This is immediately attractive since it automatically imposes a positive 
constraint in the estimates, with the formal formulation given as 


-^Er=i5(A) + ^A$(<7) 


minimize 

9 

subject to f = 1. 


(7) 


The roughness penalties studied in [Silverman] ( [19^ > are of the form 

m t [Digmf d(p) 

Jo 


where D{g) is a function of the first m derivatives of g, see Silverman ( ]1982 | for the 
specific construction. There, Theorem 3.1 also shows that Q is equivalent to the uncon¬ 
strained problem 


minimize- 

9 n 


i=i 




( 8 ) 


This alternative formulation has the nice feature that can be formulated as a convex 
optimization problem, see O'Sullivan ( ]1988 |. 

It turns out that a similar result can easily be proven for our Poisson surrogate prob¬ 
lem. This is given in the following Theorem. 


Theorem 1. With the notation from Section 1.2 it can he proven that there exists a constant 
c > 0 such that 9 solves O if only if gi = 9i — log(n 6n) solves 


minimize — 


subject to X]i=i = 1, < c. 


(9) 


Thus, we have shown that our Poisson surrogate problem is indeed a discretization 
of problem Q, where we replace the classical likelihood by a cross entropy objective, 
the integrability constraint by a constraint on the rectangle rule for the estimator, and 
the total variation penalty by a discrete version using difference matrices. 

While our histogram trend filtering approach to density estimation might seem closely 
related to the estimator from Silverman ( 1982]|, there are two significant differences. 
First, as pointed out by Sardy and Tseng (2010), the estimator given by problem (0 
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tends to over-smooth, since non-smoothness is penalized more heavily at low density 
values than at high density values, which may lead to uneven smoothing. The authors 
Sardy and Tseng] (|2010|| address this problem by imposing a a total variation penalty. 


m 


Thus, giving rise to the problem 


minimize log(/i) + A J27=2 \fi “ /*-il 

subject to f = 1 


( 10 ) 


for some integration coefficient vector a and parameter A > 0. The main motivation for 
this problem is to avoid the over-smooth solutions from solving (0. Hence, given the 
flexibility of imposing || • \\g, our histogram trend filtering estimators are also expected 
to avoid over-smoothing by taking p = q = 1. However, the other important issue 
associated with the estimator given by (j/jl is the computational complexity. This is also 
shared by the estimator from Sardy and Tseng ( 2010[ > since both of these procedures 
require to estimate a vector in R^. In contrast, we solve optimization problems in a 
significantly lower dimensional space, . 

Next we observe that by its mere definition in Problem when p = q = 1, our den¬ 
sity estimator provides piecewise polynomial solutions in the log-space. Here, the pa¬ 
rameter k in the difference matrix indicates the degree of the polynomial approximation 
used. For instance, fc = 0 corresponds to piecewise constant solutions, while /c = 1 to 
piecewise linear solutions. An attractive feature of our method is that it is not necessary 
to specify the the locations of break points; this is done adaptively by solving a convex 
optimization problem. In contrast, Barron and Sheu ( 1991| consider fitting splines in 
the log-space but this requires specification of the locations of the of the knots. More- 
Barron and Sheu (1991| provides rates of convergence for such spline estimators. 


over. 


in terms of the Kullback-Leiber divergence, when the true density satisfies 


(log/o)("+'^ 


< CX) 


( 11 ) 


with the superscript (fe-|-l) denoting the (fe-l-l)-the derivative. While we do not explicitly 
require this condition on the true density for the subsequent analysis, we do work we 
spaces of densities for which 

j I (log/)^^+^^ 1"^ < oo, 

and q G {1,2}. Thus our framework includes the less restrictive case p = 1. In fact, when 
p = 1, our result in Theorem [^provides convergence rates for our discrete estimator and 
this directly incorporates the smoothness of the true log-density captured by a difference 
operator. This is of interest given that structure is lost when we move from p = 2 to p = 1 
since Li does not come with an inner product. 

Finally, we review the penalized estimator from Willett and Nowak ( |2007| >. This is 
obtained by solving the problem 
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( 12 ) 


fw = argmin EHi log(/(2/i)) + pen(/) 

/ 

subject to J f = I, f G C 


where C is a class of non-negative piecewise polynomials and pen(/) is a functional that 
penalizes the complexity of polynomials. The solution to ([T^ enjoys attractive theoreti¬ 
cal properties, Willett and Nowak ([2007 1 shows that if /o is a member of the Besov space 


5“ (Lp([0,1])) where a > 0, 1/p = a -|- 1/2 and 0 < p < q, then. 


E 


II A/2 _ A/2, 

WJo Jw I 


< c 


9 / N \ 
2f„\\ 2a+l 


log2(n) 


n 


Moreover, the estimator fw involves using recursive dyadic partitions in order to pro¬ 
duce near-optimal, piecewise polynomial estimates, analogous to the methodologies 
Breiman et al. (1984[|; Kolaczyk and Nowak (|2004[| and Donoho et al. (1997||. Also, 


m 


this multiscale method provides spatial adaptivity similar to wavelet-based techniques 
( [Donoho et al. 1995 Kerkyacharian et al.| 1996| l, with a notable advantage. Wavelet- 
based estimators can only adapt to a functions smoothness up to the wavelets number 
of vanishing moments; thus, some a priori notion of the smoothness of the true den¬ 
sity or intensity is required in order to choose a suitable wavelet basis and guarantee 
optimal rates. The estimator fw, in contrast, automatically adapts to arbitrary degrees 
of the functions smoothness without any user input or prior information. However, 
this penalized method requires elevated computational effort. Specifically, it it involves 
0{n log2(n)) calls to a convex minimization routine and O (n log2(n)) comparisons of 
the resulting (penalized) likelihood values. The goal of this paper is to provide a com¬ 
putationally efficient estimator that can adapt to different smoothness of the true density 
and that comes with statistical guarantees. 


3 Main results 

3.1 Variational formulation and rates of convergence 

Now we present rates of convergence based on viewing (|^ as a variational problem. To 
that end, we observe that ^ can be thought as a discrete approximation to 

minimize - Xi g {f,i ) 

9 

subject to / < t 

f dp = 1. 

This formulation is more general than that of ([^, since one can always choose the 
number of points in each bin to be equal to one and replace the mid-points (^j)j^i by 
the actual observations When n is relatively small, we recommend just this. 

However, when n is large this becomes computationally burdensome. Since we should 
expect to lose information by considering counts within bins instead of the observations. 
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it is natural to ask if we can provide error bounds for both situations and to see how these 
differ. Our next theorem gives light on this point. 

To state such result, we first introduce some notation and make some assumptions. 
These assumptions are designed to ensure that the set over which we constrain the min¬ 
imization is compact with respect to the supremum norm. The idea of proving con¬ 
sistency results for non-parametric maximum likelihood estimators over a sequence of 
reduced spaces is known as the method of sieves. This technique was introduced by 
Geman and Hwang (|1982 1 and has been further studied in the literature in |Birge et al. 
( [1998[ l, Shen and Wongl T994| |, and Shen (19971. 

Given a funchon h with domain Q, we say that h is L-Lipschitz if \h{x) — h{y)\ < 
L\x — y|" for all x,y ^ Q. The order-Z weak derivahve of h is denoted by The set of 
log-Sobolev densities in = (0,1) is 

V := S^h : /i E (!d) , ^ = 1, | 

where dy. denotes the Lebesgue integral and (Q) is the Sobolev space of order 


k + l,p in Q. Recall from Oden and Reddy (2012 1 that Vk”^’P(A), the Sobolev space of 
order m, p on a bounded set A c M, is defined as the set of functions rt : A —>• M such that 




,(“)IP 


1/p 


< oo 


for all a E {0,1, ■ ■ •, nT-}, where is the a-th weak derivative of u. 

On the other hand, for an open set A c M, we denote by G(A) as the set of continuous 
function with the finite suppremum norm 


II^^IIl“(A) = sup|M(a:)|. 

xGA 

Note that the C (A) do not require its elements to be uniformly continuous but only the 
relaxed condition that they are bounded in the open set A. Finally, for a set S' C G(A) 
we denote by c1a(S) its closure in G(A) with respect to || • ||l°°(A)- 

Assumption 1. We consider In, Tn be positive sequences of numbers to be defined later. For 
now we only assume that these sequences are bounded by below. 

Assumption 2. We assume that the true density /o has support in [0,1] and is r-Lipschitz for 
some positive constant r. 

Definition 2. We define the sets Sn,i and Sn ,2 as 

Sn,i = {e^ : h£ logSn,i] 


with 


logSn,i = cln [V n <h : 


L°°{Q) — 




Lqo) 


< Tn, h is continuous 
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Definition 3. For i = 1,2 we construct 

^ni ~ <5^,1 n {e^ : h£C{Fl), h is r e~'^"-Lipschitz, > ?„ } 

Using the assumptions and notation above, we study j, the solution set of the 
problem 


minimize - V {log (/) {y^) dp} 

3 (13) 

subject to f £ Sn,i, 

for i = {1,2}. We also consider the case when Sn,i is replaced by ^ and the objective 
is replaced by a weighted likelihood 


minimize — {xj log (/) (U) dp] 

9 “ 

j=i 

subject to 


(14) 


in which case the respective solution set is denoted by M'^ ■. We proceed to state 
convergence rates proved using entropy techniques as in|Van de Geer (1990|; Mammen 
( |1991[ l; Wong and Shen ( 1995[ l; jGhosal and Van Per Vaart ( |2001| |. 


Theorem 4. Part A. Assume that = 0((logn)^'^) where q > 1/2. If there exists a 

sequence qn ,2 £ Sn ,2 and 


Mh){qn, 2 {p)} f/o(/u)^^^ - dp = 0{l/n). 


Jn L 

Then, M '^2 empty, and if Dn = for 1 < s with > n", 0 < a < 1/s, 

then there exists positive constants ci, C 2 and C 2 (independent of fo) such that for large 
enough n. 


p* 


sup d f/n,/o) > C' 2 {/o^(n)}'?n 

U^K,2 


< C2exp [—Cl {log{n)}‘^'^ “] . 


Moreover, if we replace Sn ,2 by 5^ 2 ' set a = 1, we obtain the same concentration 
bound for M^p- 

Part B. Suppose that Dn = with s > 1, (T„ e^") = O{rf)for some r £ (0,1/2), 

and there exists qnp £ Sn,i such that 


fo{p){qn,i{p)} \fQ{pY/‘^ - qnpipf^"^] dp = 0{l/n). 


9 






























If Dn = for 1 < swith Q < a <1/s, then, for all 

f , 1/2 — r 

0<*<mi”|°A l + 2/(t + l) 

we have that i 7 ^ 0 and 

P* sup 
\fn&K. 

with Cl a positive constant independent of Jq. Moreover, the same conscentration bound 
holds for Mn,i without requiring the conditions imposed by a. 

The existence of the sequences qn,i in Theorem|^is meant to impose regularity condi¬ 
tions on the true density /o that allow it to be sufficiently well approximated by elements 
of the sieve. Thus,we do not require that /o a smooth function but rather that can be well 
approximated by our sieves. On the other hand, we can think of the sieves as class of 
continuous functions that are well approximated by smooth functions. In Sn,i the con¬ 
straint given by || • is designed to enforce smoothness. Moreover, taking 

Tn= 2q log(log(n)) -h 2 "Mog(C') 

with a constant C > 0 ensures that the elements of Sn ,2 will take values in 




((logn)-29 0-1/^ (logn)2'? 0^/2) 

which approaches to (0, 00 ) as n — 00 . A similar statement can also be made about the 
sieve 5 ^, 1 . 

Note that when T„ is constant and /o G Sn, 2 , then, the first part of Theorem [^implies 


lim sup P* 


sup d (fn, fo) > C 2 {\og{n)y n 


0 . 


A related result for log-spline density estimation was found in Barron and Sheu ( |1991| > 
for the Kullback-Leibler divergence when /o in the log-space space belongs to a Sobolev 
ball. This then implies that in terms of the Hellinger distance, the estimator pn from 
Barron and Sheu|(|1991| satisfies 


r 2fc+2 - 

lim lim sup P d{pn,fo)>Kn 2(2fe+3) 

K^oo n—^oo L 


^ 0 . 


where 

B = |/ : max III log(/(^+^))||i, 2 ([o,i]), || log(/)||i,oo([o,i])} < c| 

for a positive constant c. Thus, Theorem [^provides surprising convergence results for 
the sieve Sn, 2 - Moreover, the second part of Theoremaddresses the case in which the 
true density can be approximated by functions in a sobolev ball in IP^+^T This differs 


10 




















from previous work given that spaces are not Hilbert spaces, and hence the analysis 
from Silverman (|1982| is not applicable. 


3.2 Discrete estimator 


In the previous subsection we characterized the variational estimator corresponding to 
histogram trend filtering. Next we focus on the version of the estimator where we ap¬ 
proximate the function on the discrete grid. This is necessary for finding the solution by 


numerical optimization, and is analogous to the approach taken by Koenker and Miz- 


era 


( |2007[ >, who started with a variational problem and then moved to an approximate 
solution on a grid. However, they did not provide any statistical guarantees for either 
of their formulations. 

We now denote the regularization parameter as and define the vectors 


:= {logn - logH„ log/o(6), • • •, logn - logH„ log/o(^D„)} (15) 

and 6 as the solution to Problem (j^. Thus up to a known constant of proportionality, 9^ 
and 6 are the true and estimated log densities, respectively. 

We are now ready to state our next consistency result. Its proof can be found in 
the appendix and is inspired by previous work on concentration bounds for estimators 
formulated as convex-optimization problems (e.g. Ravikumar et al. 201 Of Tansey et al. 
2 ^. 


Theorem 5. Let p = q = l or p = q = 2 and s > 2 and assume that the true density fo 
is L-Lipschitz for some constant L > 0 and Supporedi^ff) = [0,1]- Suppose that we choose 
Dn = & Then there exists a constant r > 0 and a function f satisfying 


0 < Urn n < oo, 

n^oo 

such that if we choose Tn < r then 


Dn - J 


< exp {—(j){n)}. 


Theorem is quite general, in that it establishes consistency under the assumption 
that the penalty parameter is relaxed at a sufficient rate. However, it does not explicitly 
incorporate the role of the penalty function in ^ in establishing the accuracy of the 
method. Our final result provides a bound on estimation error that does refer to the 
penalty explicity. This, unlike Theorem can be extended to densities of unbounded 
support. 

Theorem 6. Let be the point in Ij satisfying 

6nh{i'j) = fo{t)dt. 
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Let us also take b G (0,1/2) and define 9 as the solution to the convex optimization problem 


minimize J2f=i 

subject to \9j — log{n6n)\ < nf, j = 1,..., D^- 
Let us assume that (foiCj), ■ ■ ■, /o(?/>„)) belongs to the constraint set q/Q Then 

exp ( 9j 


( 16 ) 


no = 


n 6ri 


j — 1 ) • ■ •) T)n, 


satisfies 


Dr, 


i=i 


' Up ' 

, no. 


= Op 


(A(^+i))' 


Dr 

■^n 


-||A("+i)log (/o(0) lli + 


ni/2-fe 


(17) 


(n^/'^) and r = 0 (A(^+^^) ||oo^/ where s > 1 and 


where we choose Dn = 0 
r G (0,s/2). 

Theorem states convergence rates for our Poisson surrogate model estimator. In 
particular, the bound in ( [T7) | controls the Kullback-Leibler divergence between our es¬ 
timator and a discretized version of the true density. Moreover, the constraint on the 
supremum norm in the log-space space ensures that the optimization is over a compact 
set. Hence, it is not restrictive given that this bound tends to infinity as n increases. 


Finally, we emphasize that 


(A(^+b) 


proof of Corollary 4 in Wang et al 
(.1474, .1482) if if Dn is chosen between 500 and 10000. 


= 0{Dr, 


This is a consequence of the 
|2014|(. In practice we have found that || (A^^'^^)) ||on D, 


-1 


3.3 Model selection 

We know turn to the discussion of the parameters Dn and r when p = q = 1. For 
the former of these parameters, we see from the results in the previous section that 
Dn = Op/^), for s > 2, seems a reasonable choice. In practice we have observed that 
the rule Dn = 10 performs excellently and hence it becomes our default choice. 
On the other hand, for the choice of r, we see from Theorem]^ that 

r = 0(^n^-^/"|| 

is a candidate choice with r satisfying the constraint from Theorem We will see in 
the next section that this performs well in practice as well. In particular, this choice 
ensures consistency for the trivial case in which the true density /o is uniform, the precise 
definition of the universal penalty required in Sardy and Tseng ( |2010 >. 

Finally, on the choice of r, we also consider an add-hoc rule inspired by the work 
of Tibshirani and Taylor (2012| on regression problems with generalized lasso penalties. 
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This consists of computing the solution path of the problem and then considering a 
surrogate AIC approach by computing 


AIC^ 


lipr) + A: + 1 + 


[i : ^ 0 } 


The parameter r is then chosen to minimize the expression above. 


4 Examples and discussion 

4.1 Comparison with kernel methods 

We conducted a simulation study to examine the performance of histogram trend fil¬ 
tering versus some common methods for density estimation. Our first example is a 
three-component mixture of normals 

h{y) = 0.9A(2/ I 0,1) + 0.1A^(y | -2, 0.1^) + 0.1A(y | 3, O.S^) 


shown in the top left panel of Figure]^ The second example is a five-component mixture 
of translated exponentials: 

7 

f 2 {y) = '^Wc Ex{y - me I 2), 

C=1 


where the weight vector is tc = (1/7,2/7,1/7,2/7,1/7) and the translation vector is 
m = (—1,0,1, 2, 3). Here Ex{y \ r) means the density of the exponential distribution 
with rate parameter r. This density is shown in the top right panel of Figure]^ 

Our simulation study consisted of 25 Monte Carlo replicates for each of six different 
sample sizes: n = 500, 1000, 2500, 5000, 10000, and 50000. For each simulated data set, 
we ran histogram trend filtering with k = 1 and k = 2. We benchmarked the approach 
against three other methods: kernel density estimation with the bandwidth chosen by 
five-fold cross-validation, kernel density estimation with the bandwidth chosen by the 
normal reference rule, and local polynomial density estimation with smoothing parame¬ 
ter chosen by cross-validation. In the reference-rule version of kernel density estimation, 
the bandwidth is chosen to be 0.9 times the minim um of the sa mple standard deviation 
and the interquartile range divided by 1.06n“^/® (Scott 1992 1 . We used the version of 
local polynomial density estimation implemented in the R package locf it. 

Tables[^and|^show the average mean-squared error of reconstruction of all methods 
for both /i and f 2 - Order-1 trend filtering has the lowest mean-squared error across all 
situations. Figure [^provides a detailed look at the two simulated data sets. The top two 
panels show fi and /2 together with a single simulated data set of n = 2500 from each 
density. The middle two panels show the reconstruction results for histogram trend 
filtering with k = I, while the bottom two panels show the reconstruction results for 
kernel density estimation with the bandwidth chosen by cross validation. The trend¬ 
filtering estimator shows excellent adaptivity: it captures the sharp jumps in each of the 
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Example 1: data and true density 


Example 2: data and true density 



Histogram trend filtering 


Histogram trend filtering 



Kernel density estimation 


Kernel density estimation 




y 


y 


Figure 1: Top two panels: the true densities fi (left) and /2 (right) in the simulation 
study, together with samples of n = 2500 from each density Middle two panels: results 
of histogram trend filtering for the /i sample (left) and the /2 sample (right). Bottom two 
panels: results of kernel density estimation for the fi sample (left) and the /2 sample 
(right). In the bottom four panels the reconstruction results are shown on a log scale. 
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Table 1: Mean-squared error x 100 on example 1 for histogram trend filtering with 
k = 1 and k = 2 versus three other methods: kernel density estimation with bandwidth 
chosen by cross-validation, kernel density estimation using the normal reference rule, 
and local polynomial density estimation. 


n 

RTF {k = 

1) 

RTF (fc 

= 2) 

KDE (CV) 

KDE (ref) 

LP 

500 

2.5 


4.9 


3.1 

4.0 

3.3 

1000 

1.8 


2.8 


2.2 

3.8 

2.3 

2500 

1.3 


1.6 


1.7 

3.3 

1.6 

5000 

1.1 


1.1 


1.3 

3.1 

1.2 

10000 

0.7 


0.7 


0.9 

2.8 

0.9 

50000 

0.3 


0.3 


2.5 

2.2 

0.4 

Table 2: 

Mean-squared 

error x 100 on example 2 for the same five methods 

n 

RTF (fc = 

1) 

RTF(fc 

= 2) 

KDE (CV) 

KDE (ref) 

LP 

500 

5.7 


6.8 


5.5 

8.8 

6.2 

1000 

4.0 


4.6 


4.5 

8.5 

4.9 

2500 

3.0 


3.3 


3.7 

7.9 

3.5 

5000 

2.4 


2.9 


3.2 

7.6 

2.9 

10000 

2.0 


2.9 


2.8 

7.0 

2.6 

50000 

1.6 


2.9 


6.1 

5.9 

2.3 


true densities, without suffering from pronounced undersmoothing in other regions. 


4.2 Comparison with other penalized methods 

In the two previous examples we have considered comparisons versus estimation meth¬ 
ods that scale well with the number of samples. We now conclude with an example 
comparing our histogram trend filtering versus other penalized methods that face prob¬ 
lems with large numbers of samples. These methods are the the penalized likelihood 
approach from [Willett and Nowak ( 2007[ > (W-N), and the total variation approach from 
Sardy and Tseng (2010]l (TV) using their universal penalty. We also compare against 
the taut string method from Davies and Kovac (2004||, which is closely related to the 
estimator from Sardy and Tseng ( |2010| |. 

On the other hand, in the light of the previous two examples, we now only focus 
on two different variants of histogram trend filtering with k = 1. First, we compute 
the solution path of problem ^ and then we choose the tuning parameter with the 
surrogate AIC criterion described in Section 3.3 Secondly, we use the same criteria only 
on a grid of values. These values are 


{A 71 OO, A 71 O, A*, A* 10, A* 100}, 

where A* = n|| ||iiA“^. Our motivation here comes from the statement in 

Theorem 1^ 


We use these two variants of our method by borrowing a density from Willett and 
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Data and true density 


Absolute value of error for HTF (k=1) 


CO -1 


CD - 


^ — 



0.0 0.2 0.4 0.6 0.8 1.0 


Absolute value of error for TV solution 



o 

CO 

m 

c\i 

o 

c\i 

in 


o 



0.0 0.2 0.4 0.6 0.8 1.0 


Absolute value of error for W-N 



Figure 2: The left panel at the top shows the data with the true density with sample 
size n = 5000. The right panel at the top shows the plot of the vector obtained by taking 
the absolute value of the difference between HTF(k=l,using the full path) and the true 
density. The same is done for TV and W-N in the two panels at the bottom. 


Nowak (|2007[> that consists of a mixture of beta distributions. Figure [^illustrates a plot 


of this distribution. The explicit density is defined as 


fo{x) = ^ 4000,4000)) + ^ (Unif[o,i](x)) + ^ (Unif[4^i](x)) , 

where refers to a Beta distribution shifted and scaled to have support on the interval 
[a, b] and integrate to one. We use this density to generate data for different sample sizes. 

The results in [^ show that our methodology outperforms in accuracy the competi¬ 
tors. This is also visualized in Figure [^ where we can see that W-N seems to provide 
better recovery that HTF in areas where the true density behaves as smooth pol}momi- 
als. However, HTF seems to be more reliable in areas where the true density changes 
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Table 3: Mean-squared error x 10 on example 3 


n 

HTF {k = 1) 

HTF(k=l,no 
full path) 

TV 

Taut string 

W-N 

500 

1.4 

1.5 

30 

22 

3.7 

1000 

1.0 

1.1 

19 

7.4 

1.1 

2000 

0.4 

0.5 

11 

3.2 

0.5 

4000 

0.2 

0.2 

5.6 

2.1 

0.3 

5000 

0.2 

0.2 

4.9 

2.1 

0.3 



Table 4: Time in seconds on example 

3 for the different methods 

n 

HTF (k = 1) 

HTF(k=l,no 
full path) 

TV 

Taut string 

W-N 

500 

1.03 

0.02 

6. 85 

0.02 

1.15 

1000 

1.15 

0.03 

18.2 

0.02 

4.54 

2000 

1.32 

0.04 

45.3 

0.03 

22.0 

4000 

1.45 

0.06 

136 

0.07 

113 

5000 

1.85 

0.09 

237 

0.10 

202 


drastically. 

On the other hand, from Table it is clear that HTF is more efficient than W-N and 
TV which begin to have considerable problems to scale. Even computing the approxi¬ 
mate solution path for HTF(k=l) seems hundreds of times faster than solving a single 
problem for other penalized method. 


5 Conclusion 


In summary, we have shown that histogram trend filtering can be successfully applied 
to the problem of density estimation. This estimator enjoys both computational and the¬ 
oretical attractive properties. On the computational side, our experiments suggests that 
histogram trend filtering scales remarkably well with sample size, and that in practice it 
is just as computationally efficient as widely used methods based on kernel density esti¬ 
mation (KDE). However, unlike such methods, histogram trend filtering does not suffer 
from simultaneous over- and under-smoothing. Rather, our estimator can easily adapt 
to different levels of smoothness of the unknown true density. 

Many methods have been proposed in the literature to deal with the problem of lo¬ 
cal adaptivity, e.g ( [Willett and Nowak 2007 Sardy and Tsengj 20101. As our paper has 
shown, these methods face challenges specifically in regions where the smoothness of 
true density changes rapidly. We have shown that histogram trend filtering can bet¬ 
ter adjust to such situations, while overcoming the scalability problems also inherent to 
other penalized methods. Thus histogram trend filtering enjoys both the computational 
efficiency of KDE methods and the adaptive properties of penalized estimators. Finally, 
our risk bounds provide strong theoretical guarantees of good performance for his¬ 
togram trend filtering when seen as a variational problem or by its convex optimization 
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formulation. This combination of practicality with strong statistical guarantees makes 
histogram trend filtering an ideal candidate for use in routine data-analysis applications 
that call for a quick, efficient, accurate density estimate. 


A Proof of technical results 

A.l Proof of Theorem [1] 

Proof. Let us assume that 9 solves (j^. Then, we define gi = 6i — log(n(5n) and c = 
||^(fc+i)^||P_ Hence from the KKT conditions i is equivalent to 

minimize {exp(6'i) - Xi6i} 

subject to < c. 

Now, with the change of variable g = 9 + log(n dn) and dividing by n this is equivalent 
to 


minimize 6n E&i exp( 5 i) - ^ 9i 

subject to < c. 

Next we define the function 


Dn ^ Dn 

G{g) = exp(fi(i)- y^Xigi. 

2=1 2 = 1 


and for an arbitrary g G we define g' G as 


Dn 

9i = 9i- log j 5n ) • 

i=i 


Then 


Dn 


Dn 


G{g') = G{g) + 1 - ^ exp( 5 rj) + log 5n ^ exp(c?j) < G{g) 


i=i 


i=i 


since t — log(f) > 1 for all f > 0. Moreover, 

||A("+i)5||? = ||A("+V||?. 


Therefore, problem ( [18] | is equivalent to 

minimize - ^ 9i 

subject to 5n Ya=i exp(fi'i) = 1 
IIAl^+ilffll^ < c 

and the claim follows. 


(18) 
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□ 


A.2 Proof of Theorem IH 

Proof. We now focus on the proof of Theorem IH Since this requires several steps, we 
start by introducing some notation. Let 5 be a set of integrable functions with support 
[a, b] and ds a metric on Li (M). For a given <5 > 0, we define the entropy of S, denoted by 
N {6, S, ds), to be the minimum N for which there exist integrable functions /i,... /n 
satisfying 


min ds{fi,g)<6, \/g £ S. 

Ji 

The (5—bracketing number {6, S,ds) is defined as the minimum number of brackets 
of size 6 required to cover S, where a bracket of size 5 is a set of the form [l,u] := 
{h : l{x) < h{x) < u{x) Vx}, where I and u are non-negative integrable functions and 
ds{l, u) < 6. 

From now on we denote by d the distance 


d{g,h) = 


f dp 


1/2 


Moreover, for an open set Tl c M we denote by the set of of m-times differen¬ 

tiable functions for which the derivatives of orders less than or equal to m are uniformly 
continuous. 

Next recall that the Sobolev space W"^’P(Q) is endowed with the norm 


\ i/p 

where is the j — th weak deriuvative of u. 

In what follows we focus on the proof for the minimization over Sn, 2 - For the case 
we then briefly describe the corresponding modification. The proof for i = 1, 2 
is analogous. 

Lemma 7. There exists a constant Jo > 0 such that ifO<5<6o, then for all n we have 


luWw^’Pin) 


log {N{6,Sn,2, 
log {N{S,Sn,l, 


.)) < Alog(^i±^^f^^ 




where A and B are positive constants. 
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Proof. For e > 0 we first define the set 


5(e) = I + e 

I|5||l°°([o,i]) < 7n + e }, 

Then from Example 2.1 in Van de Geer ( 1990| , because r„ is bounded by below, there 
exists So such that 5 G ( 0 , (5o) implies 


log (V ((5,5(e), II • Iloo)) < ^log 


Fn + e 


Next let e > 0 fixed. Then if / G Sn, 2 , by definition, there exists h such that log(/i) G 
lV^+h 2 (ii) and 


and 


11/ - ^IIl°°(o) < ^ 

max{|| log(/i)||icx=(t^), II log(/i(^+i))||| 2 (t^)} < Tn. 


Since G*''''^(0) is dense in VF^+^’^(n), e.g (Adams and Fournier| 2003 Oden and Reddy 


2012|, by the Sobolev embedding theorem there exists 5 G 5(e) such that 


II5 - log(/i)||ioo(Q) < 6 . 

Let us now setN = N (S, 5(e), || • ||oo) and let gi,..., gjy G 5(e) be functions such that 
for every g G 5(e), there exists i G {1,..., A^} satisfying 


II9 “ fi'i||L°°([0,l]) < 

Then for / G 5„,2 choosing h as before and gi G 5(e) such that ||g'j — log(/i)||^oo(Q) < 2S 
we obtain that for all x G (0,1) 

|/(x) — < max (/i(x), |log (/i(x)) — g'i(x)| + 5 

< max (/i(x), e^°s(^(^))+2 5) 2 (5 + (5 

< (1 + 2ee'^")(5. 

Therefore, for all e > 0 

log(V(<5,5„,2,|| • Iloo)) < log ( (l+2ee^n) ^>g(^). II ' lU)) 

Letting epsilon go to zero we arrive to 

log {N {S, Sn,2, II • Hoc)) < 241 og (^ Tnil + 2ee^-) ^ 

Hence the result follows for Sn, 2 - The proof for the sieve 5n,i follows the same lines 
with entropy bound for the corresponding 5 coming from the proof of Theorem 2 in 
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Mammen ( |1991 >. 


□ 


Corollary 8. With the notation from the previous lemma, there exists S'q such that 0 < S < S'q 
implies 

log {N[] {5, Sn, 2, d)} < 


+B 


2T„ (e^" 2e+l) 


“ 5 ^ 


Proof. Let j e {1,2}. We proceed as in the proof of Lemma 3.1 from [Ghosal and Van| 
Der Vaart |200l| >. First, given 5 G (0, do), we define p = Next, let /i,..., /tv be 


non-negative integrable functions with support in [0,1] such that for all h G Sn,j, there 
exists i G {1,..., N} such that ||/i — /i||L«)(o) < p- We then construct the brackets [k, Ui] 
by defining 


k = max{fi - p,0), Ui = {fi + p) 
Then Snj C Ui]. Since 0 < Ui — k < 2p, we obtain 


Therefore, 


foo rl 

\ui{p) - li{p)\dp = / \ui{p)-li{p)\dp<2p. 
—oo J 0 


iV[] {2p, Snj, II • 111) < N. 

The results then follows from the previous lemma by choosing p = (2 5^, implying 

that 


log{iV[] 


< log{Ai[] ((52,5n,j, II • 111)} 

< logjiV II • Iloo)} • 


□ 


Existence We now show that the sets Mn,i and M'^ ■ are not empty. To this end, note 
that in ((7(11), || • ||loo(o)) we have that 


{/i : G Sn,i} C do {V PI C'(n)) Pi do (^h : h £ h is continuous^ 

hence by the Sobolev embedding theorem we obtain that {/i : G Sn,i} is compact 

in (C(fl), II • IIloo(o)). Similarly, {/i : G SW} is also compact in [C{Q), || • ||Ltx>(o)). 


Rates We conclude the proof by using Theorem 1 from Wong and Shen ( 1995| . First, 
we observe that if a G (0,1], then e„ = (logn)'^ satisfies 
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(iV[](n/c3,5„,2,d))d^ < 


21/2^1/2 


C4 n 


1/22 



f2Tn (2e^" e+l)c§2i® 

I 4 


1/2 


for large enough n where B is some positive constant and C3 and C4 are given as in 
Theorem 1 from Wong and Shen ( 1995[ |. Hence, the claims follows for Sn, 2 - 

To conclude the proof for the sieve 5^ 2/ observe that for any observation yj there 
exists ^ji such that yj and ^ji belong to the same bin. With an abuse of notation we will 
denote such ^ji as ^j. Then for positive constant c, if n is large enough we have 


p* < 

sup d(fn,foj>en 

> < inf P* < 

sup nj=i Kij)/g{ij) > exp {-cnel) 



9^4,2 

heS'^ 2 '■ 


But for any g,h e S'^ 2 have 

n n 

n Kij)/9{ij) = n {hiVj)/9{yj)] {g{Vj)/g{ij)] {h{ij)/h{yj)] 

i=l j=l 

and by the Lipschitz continuity condition. 


n”.iftfe)A(!/j) < {1 + 434} 

' n 


< (1 + ^)^ 


Similarly, 

is not empty. 


Then for large enough n. 


'[lgiyj)/g{Cj) < (i + 


i=l 


'f \n 




1 sup d ( 

fnjo) >en 


) 


9^S'^ 


n,2 


—2 nlog (1 + rjn^) } 


sup U.]=i Hyj)/g{yj) > exp{ - cni 


< inf P* 


sup 


964,2 I /ies;_ 2 :d(h,/o)> 6 , 
< 6exp (-C2ne2) , 


OLi Kyj)/g{yj) > exp {-cmel) 


if 0 < c < Cl where ci and C2 can be obtained from Theorem 3 from [Wong and Shen 
(|1995 1 and P* is understood as the outer measure corresponding to /q. 
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Finally, we replace Sn ,2 by Sn,i, and set e = as in the statement of the theorem. 
Then the same argument from above shows that the solution set i is not empty. A 
similar argument as above leads to the desired conclusion for both sieves 5^,1 and S'^ ^ 

’□ 


A.3 Proof of Theorem [H 

Throughout we define the vectors 

90 (0 = {log/o(6),---,log/o('^D„)} 

aiO = {iog/(?i)> • ■ ■ >iog/(CD„)} 

Proof. We first prove the case where p = q = 1. To that end let fci > 0 and mi > 0 be a 
lower bound and upper bound on the true density. Let us also define 

9 = argmin |/( 6 ») + rn||, 

We consider the function G : —)> M as 

G{u) = l{0^ + u)- l{0^) + Tn (||A(^+bn + A(^+b0O||i - ||A(^+b90||i) , 

and we show that with high-probability this function is strictly positive in the boundary 
of the Euclidean unit ball. The result will then follow because G is convex, G(0) = 0, 
and G{0 — 0) < 0. 

To show this we first observe that by the taylor's expansion, there exists aj G [0,1] 
such that 


1(0° + u)- 1(0°) 


> 


^ |exp(0° Uj) - Xj (0° + Uj) + Xj 0° - exp(6'°)| 

Dn 

I ttjexp(6'^) - UjXj 2“^exp(6'° Ojufuj V 
— ||exp(9°) — x||^ Dn‘^\\u \\2 + 2~^nD~^ ki 


2 

2- 


On the other hand. 


Tn (||A(*^+bii + A(^+b6)0||^ _ ||A(^+b6)0||^^ > 

> 

> 

> 


-Tn IIA^^'+biilli 
-Tnk4, ||A(bii||^ 
-Tnki ||A(bii||2 

-TnDl/"^ k2 \\u\\2 


for positive constants k 2 , k^ and k^. Therefore, if ||tt||2 = 1 we obtain 
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G{u) > — ||exp(0°) — x||^ + 2 ^nD^^kie ^ — TnDU‘^k 2 > 0 

if only if 

n ki e"^ > 21)^2 {||exp(6»°) - x||^ + k 2 Tn] ■ 

Next, we observe that by the mean value theorem for integrals we have Xi ~ Binomial{-D“^/o(2;i), n} 

for some Zi in bin i. Then for any f > 0 and f G {1,..., Dn], using Chernoff's bound we 

obtain 


P{|xi - nT),, Vote)| >t}<F {\xi - fQ{zi)\ > t - \nD^^ fo{^i) - n fo{zi)\} 

< exp{-e2(2 + e)-^nD-^ fo{zi)} + expj-e^ 2 -^nD-^ foizi)}, 


if e > 0 where 


e = tn '^Dnfoizi) - foizi) \M(i) - fo{zi)\■ 


We choose t = Cn Dn for some constant C > 0 and observe that the respective e is 
positive for large enough n. To see this we observe that 

e = Mzi)-^CD-^/^-foizi)-^\fom-Mzi)\ 

> fo{zir^(CD-^/^ -LD-^" 

Hence for large enough n we see that 

Dn 


pr (^||exp(6'0 )-x||^ > < ^pr ||xj - ni/o(y)| > ncj 

I nD - nDn^ foizi) 


i=i 

Dn 

< J^exp 

i=i 

Dn 

+^exp 

i=i 

Dn ( 

< J^exp<^ 

j=i I 

Dn 


n, nPn^^'^C MU) ^ 

"Dn^foM) fo(^i) 


nPn'^^^C 
riDn^ foizi) 


/ote) 

foiu) 


- 1 


} n2 ^D^^fo{zi) 


fo{zi)-^(c Dn'^^^-LDn^y nD-^ foizi) 

n I I f(U) -I I 

/o(s) |/o(s) I 

-,- 1/2 


Tj^exp <1 -2-i/o(^i)-" (CH-'/'-LH-i) nD-^Uzi 
/=i 


Therefore, if Dn = an as in the statement of the theorem, then 
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pr |||exp(0°) — x||^ > nDn^^‘^ C 


Dn 

} < J^exp 

i=i 

Dr, 


/o(2i)-^(c-LD„^/^)''nD-2/o(^i) 
1-1/2^ 


2 +- 


fo(^i) 


i=i 


+^exp I -/o(zi) 2 - LZ)n n2 fo{zi) 


-1 1.-1 


Hence we set C = 4 ^ cfci e ^ for some c G (0,1), choosing r = 4 ^ (1 — c) fci e ^ ^2 
ensures that with high probability. 


n 


he ^ {||exp(6»°) - Xj\\oo + k 2 Tn} ■ 


If, on the other hand, p = q = 2, then the proof follows the same lines, with the main 
modification involving the following bound: 

Tn - ||A(^+^)6»°|||} > Tn ( \\A^'^+^'>u\\l - 2 ||A(^+^)m||2 ||A(^+^)6»°||2) 

> -T,,2||A(^+i)5(0ll2||A(^+i)n||2 

> -Tn A:5 ll^COlb ||A(^+^)u||2 


□ 


> -TnhDl/^\\u\\2. 

for some positive constants h and h. 

A.4 Proof of Theorem [6] 

Before beginning the proof of the claim we start by proving an auxiliary lemma. 
Lemma 9. With the notation from Theorem^ if a e then 


P ^1 (x — exp(0‘’))" 


a\ > 


^ 11 ^ 11CX) 

Dr 

'tn 


< 4 exp ( —c 


n 

'ff2r 

'n 


for allr > 0 and some positive constant Cr depending on r. 

Proof. Our proof is inspired by the construction in Lemma 3 from Devroye ( |1983 >. We 

start by denoting pi = —i = 1,..., Dn- Then we can think of Xi as the occurrences 

of value i among ui,..., Un where P(tifc = j) = pj for j = 1, ...Dn and k = 1,2,_ 

Next, we define N ~ Poisson (n), and x' as the occurrences of value i among ui,... ,un. 
Clearly, x' ~ Poisson (np,). Moreover, 


D„ 




i=l 


< 


Dn 


(x- - npi) 


2=1 


+ 


Dn 


ai {xi - X •) 


2=1 


form which 

Dn 


Ywi (d - npi) 


2 = 1 


> 2e < P (||a||oo|A - n| > e) + P 


Dn 




2 = 1 


>e] (19) 
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for all e > 0. We now bound both terms in ( [l^ . First, we proceed using Hoeffding's 
inequality. 


(Siii ~ '^Pi) ^ f) ^ exp t + J 2?=1 '^Pi (exp(f tti) -1-t 

n (exp(t ||a||oo) ~ 1 ~ ^ II' 

+ n(exp(;^)-l-^' 


< inf exp {—et + n (exp(t ||a||oo) — 1 — t ||a||oo)) 



for some positive constant c if is large enough. Therefore, setting e = ci n ||a||oo Dn'^ 
with Cl > c, we obtain 


' Dn 


P {x'i - npi) > Cl 


n a 


vi=l 


Dr 


< exp 


(ci - c)n\ 
D2r j ■ 


With union bound inequality and repeating the same argument from above, we arrive 
to 


Dn 


P I {x'i - npi) I > Cl 


i=l 


^ 11 ^ 11 CXD 


< 2 exp — 


(c 2 - c)n 

~W^ 


Finally, from the proof of Lemma 3 in Devroye ( |1983| l we have 

^ 11 ^ 11 CXD 


|a||oo|A^ -n\ > ci- 


Dr 


I c? n 


and the result follows. 


□ 


Proof. Let ci an element of the canonical basis in and let us denote by P the orthogo¬ 
nal projection onto the row space of We start by noticing that from sub-optimality 
we have 


l{9) + A ||A(*^+P0||i < /(0°) + A ||A(*^+P0°||i. 

Hence, setting \ = tI2, we obtain 


foiCj) log 



< 


P (x - exp(0O))^ (^(A(^+P) A(*^+P + (^e - 0°) 

+ A(^||A(^+i)0O||i-||A(^+i)0||i). 

( 20 ) 


Next we bound each of the terms on the right hand side of ( [^ . First, define vi,..., Vk+i 
to be an orthonormal basis of such that vi = (1; ■ ■ •; !)■ Then, it is not difficult 
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_^ /2 

to see that these vectors can be chosen to satisfy Ht’j ||oo = 0{Dn ' ) for j = 1,...,A: + 1. 
Therefore, by Holder's inequality 


l{x- exp(0O)) (e - i Y!jtl [(a; - exp(0O)) \vj [9 - 9° 

< ^\\x- exp(6»°)||i (^11 log(/o(^'))lloo + II log(/(^') 

( 21 ) 

It follows form Lemma 3 in Devroye ( [1983| that 

i (x - exp(0O))'^ P^x (0 - 0°) = Op 


1 / 2-6 / ■ 


assuming that we constraint \\9 — log(n 6, 
On the other hand. 


n}\\OQ 


< n^. 


< 


2 (x - exp(0O))^ ((A(^+i))- A(^+i)) (0 - 0° 

2 II (x - exp(0°))^ (a(^+^)) IIoo (^IIA^^^+^^^^lli + ||A(^+^)0|| 

Moreover, from the previous lemma we obtain 

n II (A(^+^))” ||o 


( 22 ) 




(x - exp(0°)) 


1 ) 


Dr 


n 


< 4exp ( + log(Pn) 


Therefore, combining (20), (21) and (22), if A > || (x — exp(0°)) ||oo,then. 




+ 


1 

, 1 / 2-6 


(23) 

□ 
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